Entity Resolution with Evolving Rules

نویسندگان

  • Steven Euijong Whang
  • Hector Garcia-Molina
چکیده

Entity resolution (ER) identifies database records that refer to the same real world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema and application are better understood. We address the problem of keeping the ER result up-to-date when the ER logic “evolves” frequently. A naı̈ve approach that re-runs ER from scratch may not be tolerable for resolving large datasets. This paper investigates when and how we can instead exploit previous “materialized” ER results to save redundant work with evolved logic. We introduce algorithm properties that facilitate evolution, and we propose efficient rule evolution techniques for two clustering ER models: match-based clustering and distance-based clustering. Using real data sets, we illustrate the cost of materializations and the potential gains over the naı̈ve approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بهبود شناسایی موجودیت‌های نامدار فارسی با استفاده از کسره اضافه

Named entity recognition is a process in which the people’s names, name of places (cities, countries, seas, etc.) and organizations (public and private companies, international institutions, etc.), date, currency and percentages in a text are identified. Named entity recognition plays an important role in many NLP tasks such as semantic role labeling, question answering, summarization, machine ...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Combining Linguistic Knowledge and Machine Learning for Anaphora Resolution

This thesis focuses on attempting to incorporate linguistically contextual clues into machine learning-based approaches for anaphora resolution, which is the process of identifying whether or not an expression refers to another expression. As the state-ofthe-art of morpho-syntactic analysis and named entity recognition grows more increasingly sophisticated, research focus in natural language pr...

متن کامل

Application of association rules mining to Named Entity Recognition and co-reference resolution for the Indonesian language

In this paper, we propose a new method, association rules mining for Named Entity Recognition (NER) and co-reference resolution. The method uses several morphological and lexical features such as Pronoun Class (PC) and Name Class (NC), String Similarity (SP) and Position (P) in the text, into a vector of attributes. Applied to a corpus of newspaper in the Indonesian language, the method outperf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2010